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Abstract 

In recent years, longitudinal family-based studies have had success in identifying genetic variants that influence 
complex traits in genome-wide association studies. In this paper, we suggest that longitudinal analyses may contain 
valuable information that can enable identification of additional associations compared to baseline analyses. Using 
Genetic Analysis Workshop 18 data, consisting of whole genome sequence data in a pedigree-based sample, we 
compared 3 methods for the genetic analysis of longitudinal data to an analysis that used baseline data only. These 
longitudinal methods were (a) longitudinal mixed-effects model; (b) analysis of the mean trait over time; and (c) a 
2-stage analysis, with estimation of a random intercept in the first stage and regression of the random intercept on a 
single-nucleotide polymorphism at the second stage. All methods accounted for the familial correlation among 
subjects within a pedigree. The analyses considered common variants with minor allele frequency above 5% on 
chromosome 3. Analyses were performed without knowledge of the simulation model. The 3 longitudinal methods 
showed consistent results, which were generally different from those found by using only the baseline observation. 
The gene CACNA2D3, identified by both longitudinal and baseline approaches, had a stronger signal in the 
longitudinal analysis (p = 2.65 x 1CT 7 ) compared to that in the baseline analysis (p = 2.48 x 1CT 5 ). The effect size of 
the longitudinal mixed-effects model and mean trait were higher compared to the 2-stage approach. The 
longitudinal results provided stable results different from that using 1 observation at baseline and generally had 
lower p values. 



Background 

Longitudinal data analyses are widely used in genome- 
wide association studies to assess genetic and environ- 
mental risk factors and their association with phenotypes 
of interest [1-3]. They are more complicated than ana- 
lyses using only baseline measures because subjects are 
followed over time and change is measured during fol- 
low-up. Standard linear regression techniques are not 
applicable in this setting because of the correlation that 
exists among the repeated measures per subject. Methods 
for longitudinal study designs have enabled the investiga- 
tion of genetic variation influencing trait values over time 
[3]. In Genetic Analysis Workshop 13, Gauderman et al 
[4] provided an overview of a wide range of methods for 
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the genetic analysis of longitudinal data in families. They 
summarized these methods into 2 groups: (a) 2-stage 
approaches, in which a summary statistic is obtained and 
used in genetic analysis, and (b) joint modeling, in which 
the genetic and longitudinal data are analyzed simulta- 
neously in a single analysis. They argued that the use of a 
mean-type statistic could provide greater power com- 
pared to a slope-type statistic for detecting a gene effect. 
Zhu et al [1] performed a genome-wide association in 
which they identified genes and gene-environment inter- 
actions associated with longitudinal traits. They imple- 
mented a multivariate adaptive spline for the analysis of 
the longitudinal data. 

In this paper, our main object is to compare existing 
methods of longitudinal data analyses with those that use 
only 1 baseline measure in association studies. We 
explore the following longitudinal methods: (a) a longitu- 
dinal mixed-effects model; (b) analysis of the mean trait 
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over time; and (c) a 2-stage analysis, with estimation of a 
random intercept in the first stage and regression of the 
random intercept on a single-nucleotide polymorphism 
(SNP) in the second stage. These longitudinal methods 
use statistics that capture the level of a trait, such as a 
mean, to detect genetic associations as opposed to meth- 
ods that focus on the change in the trait over time, such 
as a slope. Despite the strengths and integrated approach 
of a longitudinal mixed model, its implementation is very 
computer-intensive because of its complex structure. 
Therefore, the main motivation for trying some "simpler" 
alternative longitudinal models, such as analysis of the 
mean trait over time and a 2-stage analysis, is to see if 
they can serve as good substitutes with equally good 
performance. 

Methods 

Study subjects and phenotype 

We used real phenotype data collected in the San Antonio 
Family Heart Study, including sex, age, year of examina- 
tion, systolic and diastolic blood pressure, use of antihy- 
pertensive medications, and tobacco smoking at up to 4 
time points for 939 subjects in 20 pedigrees. Of the 939 
participants, 244 attended only 1 exam; for the remaining 
subjects, the median follow-up time was 11 years with a 
median gap time between assessments of 5 years. We ana- 
lyzed 2 continuous traits: systolic blood pressure (SBP) 
and diastolic blood pressure (DBP). For participants on 
medication, we imputed both SBP and DBP to mimic 
what their unmedicated values would be. If a subject was 
on medications at an exam, we imputed the blood pres- 
sure at this exam to be the average blood pressure of all 
observations with higher values among those of the same 
gender and ± 10 years of the age of the subject. We per- 
formed a preliminary analysis to select covariates for both 
SBP and DBP. Variables significantly associated (p < 0.05) 
with SBP or DBP were selected. For SBP, we adjusted for 
age, sex, and tobacco smoking. For DBP, we adjusted for 
age, sex, tobacco smoking, and centered age squared. 

Genetic data 

The genetic data from Genetic Analysis Workshop 18 
(GAW18) consisted of whole genome sequence data in 
a pedigree-based sample with longitudinal phenotype 
data for hypertension and related traits. A total of 26.8 
million SNPs were identified in the 483 individuals. 
After eliminating 19 outlier individuals who failed to 
meet SNP quality control criteria such as fractions and 
ratio of homogeneous and heterogeneous sites and frac- 
tion of novel SNPs, 24 million SNPs passed support vec- 
tor machine and indel proximity filters. Genotype calls 
cleaned of mendelian errors and dosages were provided 
for 959 individuals (464 directly sequenced and the rest 
imputed) for 8,348,674 locations in the genome. 



A majority of the SNPs were rare variants; 51% had a 
minor allele frequency (MAF) below 1%. As suggested 
by GAW18 leaders, all analyses for this current paper 
were based on 402,985 common variants (MAF >5%) of 
chromosome 3 only, accounting for around one-third of 
the total number of variants on the chromosome. 

Statistical analyses 
Baseline association analysis 

For comparison with the methods that used the longitudi- 
nal data, we applied a baseline association analysis that con- 
sidered only the first observation (baseline) for each person. 
In addition to adjusting for covariates, we incorporated a 
familial correlation structure (kinship coefficient matrix) 
into the model as Yyo = A) + Xy 0 ft + PsSNPy + ay + By, 
where i denotes the i pedigree, and /' denotes the ;' th indivi- 
dual in the i pedigree. For this individual, Yy Q denotes the 
phenotype at baseline, Xp = (Xyoi, . . . ,X,jo m ) denotes the 
covariates at baseline, and SNPy denotes the SNP dosage. 
/S 0 is the fixed intercept, = [f)i, f} 2 , ■ . ■ , Pm)' is a vector of 
regression coefficients for the m covariates, and f} s is the 
SNP effect size; a ij is the random intercept for the per- 
son. Within each pedigree, the vector a* = [an, . . . , a I)Tj ) 
is normally distributed with a mean of 0 and a covariance 
matrix of a 2 Y.ki„ ( tne kinship matrix), contributing a diago- 
nal block for each pedigree to the overall covariance matrix; 
e ij is an error term with a mean of 0 and a variance of <r E 2 . 
This model was implemented using the lmekin package in 
R (version 2.9.2) package "kinship" [5], which employed 
maximum likelihood methods to estimate parameters. 

The notations of /So, /S, p s ,SNPy,ay,Sij used in this 
baseline model apply to the following models where 
applicable. 

To compare with the baseline approach, we consid- 
ered 3 approaches for longitudinal analyses of these 
data: (a) longitudinal mixed-effects association analysis, 
(b) mean measure in longitudinal association analysis, 
and (c) 2-stage longitudinal association analysis. 
Longitudinal mixed-effects association analysis 
We used a random-intercept mixed effects model with 
familial correlation structure [7]. The model is: 

Y ijt = fa + Xy t/ S + frSNPy + ay + Sy (1) 

Here i denotes the i* pedigree, and / denotes the 7 th indi- 
vidual in the i th pedigree. For this individual, Yy t denotes 
the trait at time point t; Xy t = [Xy t i,Xy t2 , . . . ,Xy tm ) denotes 
the covariates at time t, including time-dependent covari- 
ates. This model was implemented in the R (version 2.15.1) 
package "pedigreemm" [6], which used the method of 
restricted maximum likelihood for parameter estimation. 
Mean measure in longitudinal association analysis 
We also considered the mean across all time points as 
the trait and its corresponding averaged covariates as 
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one alternative for longitudinal association analysis. This 
model is: 

Y* = p 0 + X* 0 + frSNPij + a tj + (2) 

Here i denotes the i th pedigree, and /' denotes the /' indi- 
vidual in the f 1 pedigree. For this individual, Yjjj denotes the 
mean trait across time. X*j = (X| 1( X| 2 , ■ ■ ■ <X| m ) denotes 
the covariates, which for time-dependent covariates is the 
average measure across time. 

This model was implemented using the function lme- 
kin in R (version 2.9.2) package "kinship" [5], using 
maximum likelihood methods to estimate parameters. 
Two-stage longitudinal association analysis 
Another longitudinal approach employs a 2-stage strat- 
egy [4]. In the first stage, a random intercept, as the 
level of the trait for each person was generated from a 
growth curve model: 

Yijt = Pw + Xij t jS + Qfij + £ij (4) 

Here i denotes the z' th pedigree, and /' denotes the /* indi- 
vidual in the i pedigree. For this individual, Y;j t denotes 
the trait at time point t. X;j t = (X;j t i, X;j t 2, . . . ,Xij tm ) 
denotes the covariates including time-dependent covariates. 
/3 io is the fixed intercept of the first stage; a ij is the random 
intercept. As above, the covariance structure of a ij is 
a 2 Skin/ which contributes a diagonal block for each pedi- 
gree to the overall covariance matrix. 

In the second stage, random intercept a ij is treated as 
the "new" trait and regressed on a SNP as follows: 

oiij = P20 + PsSNPij + yij + Sij (5) 

Here SNPjj denotes the SNP dosage. P20 is the intercept 
of the second stage; fi s is the SNP effect size; e y is an 
error term with a mean of 0 and a variance of of yij is the 
random intercept that adjusts for the familiar correlation 
of a ij; and, similarly, the vector Yi = [yn> • • • < Yin,) is nor- 
mally distributed with a mean of 0 and a covariance 
matrix of a 2 contributing a diagonal block for each 
pedigree to the overall covariance matrix. 

Gauderman et al [4] pointed out that a mean-based 
statistic is more powerful to detect a genetic association 
than a slope-based statistic (eg, a random slope). So 
here we adopted the random intercept of the first stage 
rather than the random slope as the "trait" in the second 
stage. The first-stage model was implemented using 
lmekin of the R (version 2.15.1) package "coxme" [6], 
which could handle more than 1 random effect; the sec- 
ond-stage model was implemented using lmekin of the 
R (version 2.9.2) package "kinship"[5], which adopted a 
faster computing algorithm. Both packages used maxi- 
mum likelihood in parameter estimation. 



Power and type I error 

We conducted power calculations for all 4 methods and 
evaluated type I error by means of the genomic control 
value. We chose the variant (chromosome 3: 47956424) 
on gene MAP4, the top variant influencing simulated 
SBP and DBP, as the functional variant for power calcu- 
lations. To determine power, we tested the null hypoth- 
esis that the trait SBP was not associated with the 
functional variant, versus the alternative hypothesis that 
it is associated. Therefore, results would be considered 
statistically significant if the p value obtained using the 
analysis methods fell below a predetermined threshold. 
Here we divided the significance level 0.05 by the 
approximate number (25,676) of independent SNPs on 
chromosome 3 to adjust for multiple testing. We used 
PLINK (http://pngu.mgh.harvard.edu/~purcell/plink/) 
[8] to prune out SNPs on chromosome 3 where the 
pairwise linkage disequilibrium was 0.2 or greater, and 
25,676 SNPs remained. For each of the 4 methods, the 
estimated power was the proportion of replicates in 
which the method detected a significant association 
between the trait and the functional variant. 

For each of the 4 methods, genomic control value was 
used to assess the extent of the inflation of type I error, 
based on the p value of common variants on chromo- 
some 3. 

Results 

Association analysis of real data 

For SBP, there were no shared results in the top 10 hits 
between the baseline approach and the other 3 longitudi- 
nal methods (Table 1). Some shared genes identified by 
the longitudinal methods were FGF12 and FHIT. The 
mean measure and 2-stage methods yielded similar results. 
For DBP, the 3 longitudinal methods yielded consistent 
results (as shown in Figure 1, right side): the top 10 hits 
came from the same gene (CACNA2D3 in Table 2; eg, 
SNP 3_54748234 has a p value of 2.65 x 10~ 7 ), with SNPs 
nearly reaching a Bonferroni significance threshold. This 
gene was also found using the baseline method but was 
less significant (rank = 2, p = 2.76E-05 in Table 2). 

Power and type I error 

Power was computed to assess the baseline method and 
the 3 longitudinal methods (Table 3). The 3 longitudinal 
methods had at least 10.5% higher power than the base- 
line method. Among the longitudinal methods, the 
power of both mean measure and 2-stage methods was 
comparable (41% and 40.5%, respectively) and substan- 
tially higher than that of the linear mixed-effects (LME) 
method (32.5%). None of the 4 methods showed ele- 
vated type I error because the genomic control value 
ranged from about 0.98 to 1.034. 



Table 1 TOP 10 hits of SBP on chromosome 3 across the baseline method and the 3 longitudinal methods 



Baseline 










Longitudinal 










Mean measure 








Two-stage 










SNP 


Effect 
size 


SE 


P 


Closest* 
genes 


SNP 


Effect 
size 


SE 


P 


Closest* 
genes 


SNP 


Effect 
size 


SE 


P 


Closest* 
genes 


SNP 


Effect 
size 


SE 


P 


Closest* 
genes 


3_1 49871 159 


531 


1.22 


1.63E- 
05 


LOC646903, 


3_1 062201 30 


-4.36 


1.02 


2.16E- 

05 


LOCI 00302640 


3_1 062201 30 


-4.68 


1.05 


9.16E- 
06 


BCHE 


3_1 65046920 


2.96 


0.68 


1.33E- 
05 


5LITRK3 


3_1 331 6091 1 


4.02 


0.93 


1.83E- 
05 


BFSP2 


3_1 13652027 


3.77 


0.89 


2.34E- 
05 


GRAMD1C 


3_1 06220437 


-4.65 


1.05 


1 .06 E- 

05 


FGF12 


3_59966975 


2.52 


0.58 


1.59E- 

05 


FHIT 


3_1 498942 19 


5.10 


1.22 


3.41 E- 
05 


LOC646903, 


3J06217172 


-4.33 


1.03 


2.57E- 

05 


LOC100302640 


3J06217172 


-4.66 


1.05 


1 .07E- 
05 


FGF12 


3_1 92240010 


-3.67 


0.86 


2.14E- 
05 


FGF12 


3_1 22390279 


5.24 


1.26 


3.65E- 
05 


PARP14 


3_1 65046920 


4.21 


1.00 


2.82E- 
05 


SLITRK3 


3_1 0621 8053 


-4.62 


1.07 


1 .64E- 
05 


FHIT 


3_1 9223981 5 


-3.81 


0.90 


2.31 E- 
05 


FGF12 


3_571 73021 


3.87 


0.93 


3.75 E- 
05 


IL17RD 


3_1 06220437 


-4.29 


1.02 


2.89E- 
05 


LOCI 00302640 


3_1 0621 9390 


-4.62 


1.07 


1 .64E- 
05 


DOCK3 


3_50996289 


4.18 


0.99 


2.55E- 
05 


DOCK3 


3_1 20020489 


3.73 


0.90 


3.95E- 
05 


LRRC58 


3_59966975 


3.58 


0.86 


3.15E- 
05 


FHIT 


3_1 06231 571 


-4.54 


1.05 


1.71 E- 

05 


BCHE 


3_1 65049274 


2.75 


0.66 


3.35E- 
05 


SLITRK3 


3_1 20023242 


3.73 


0.90 


3.95E- 
05 


LRRC58 


3 192239815 


-5.47 


1.31 


3.20E- 
05 


FGF12 


3_1 06232849 


-4.51 


1.05 


1 .86E- 
05 


BCHE 


3_1 65046402 


2.71 


0.66 


4.21 E- 
05 


SLITRK3 


3_1 081 88993 


-3.77 


0.91 


4.00E- 

05 


MYH15 


3_1 92240010 


-5.28 


1.27 


3.27E- 
05 


FGF12 


3_1 06220258 


-4.39 


1.02 


1.86E- 
05 


ZNF385D 


3_1 65053404 


2.63 


0.66 


6.83E- 
05 


5LITRK3 


3_1 40640076 


7.59 


1.84 


4.08E- 
05 


SLC25A36 


3_72678387 


-4.30 


1.04 


3.38E- 
05 


SHQ1 


3_1 06220368 


-4.45 


1.05 


2.41 E- 
05 


MYH15 


3_21 520730 


4.60 


1.16 


7.82E- 
05 


ZNF385D 



3_1 58228266 -4.87 1.18 4.31 E- RSRC1 3_1 0621 8053 -4.28 1.04 4.20E- LOC100302640 3_72678387 -4.53 1.07 2.56E- BCHE 3_1 13652027 2.40 0.60 7.94E- GRAMD1C 

05 05 05 05 



•In the field "closest genes"; a bold gene name indicates that the SNP on the same row is right on that gene. 



Wang et al. BMC Proceedings 2014, 8(Suppl 1):S84 
http://www.biomedcentral.eom/1753-6561/8/S1/S84 



Page 5 of 7 



Baseline association study of SBP on Chr3 



5 CO — 





1 



i 1 1 r 

Position 



"I 1 



Longitudinal association study of SBP on Chr3 




Position 

Mean measure longitudinal study of SBP on Chr3 

:• I 

, — i— ■ 



: hi . J.-. 




i. 




Position 



Two-stage longitudinal study of SBP on Chr3 



I 

1 - 
a 

i 





— i 1 1 1 1 

4000MOO oooooooo iiooeoooo ioooooooo torwoooo 



Baseline association study of DBP on Chr3 



1 

5 n 



it • li ii' iiiu 




•Lt»J 

> t' >3 




— i 1 

MtOMG "i.'u-mi 



Position 



Longitudinal association study of DBP on Chr3 




oooooooo to^ n woo 



Mean measure longitudinal study of DBP on Chr3 





Position 



Two-stage longitudinal study of DBP on Chr3 



s 

5 c -t 

3 

i 



i ; 1 1 





T T 1 

II0O0O00O 1*0000000 i«r*WM0 



Position Position 

Figure 1 Manhattan plots on chromosome 3 using both the baseline and 3 longitudinal methods for SBP and DBP. 



Discussion and conclusions methods were more computer efficient than the LME 

For both traits, the genes identified by the 3 longitudinal method. Furthermore, these 2 longitudinal methods 

methods were consistent, but different from those found were more powerful than the LME method. These 2 

with the baseline approach. From the perspective of methods can act as efficient and powerful "substitutes" 

computational time, the mean measure and 2-stage for LME. The mean measure method worked as well as 



Table 2 TOP 10 hits of DBP on chromosome 3 across the baseline method and the 3 longitudinal methods 



Baseline 










Longitudinal 








Mean-measure 








Two-stage 










SNP 


Effect 
size 


SE 


P 


Closest* 
genes 


SNP 


Effect 
size 


SE 


P 


Closest* 
genes 


SNP 


Effect 
size 


SE 


P 


Closest* 
genes 


SNP 


Effect 
size 


SE 


P 


Closest* 
genes 


3_1 64797024 


4.98 


1.17 


2.48E-05 


SI 


3_54748234 


2.20 


0.43 


2.65E- 
07 


CACNA2D3 


3_54748234 


2.23 


0.44 


3.73E- 
07 


CACNA2D3 


3_54757032 


1.10 


0.21 


1 .83E- 
07 


CACNA2D3 


3_54748368 


-2.15 


0.51 


2.76E-05 


CACNA2D3 


3_54757032 


2.21 


0.43 


3.21 E- 
07 


CACNA2D3 


3_54757032 


2.22 


0.44 


5.77E- 
07 


CACNA2D3 


3_54748234 


1.11 


0.21 


2.39E- 
07 


CACNA2D3 


3J24142019 


2.85 


0.69 


4.62E-05 


KALRN 


3_54748368 


-2.15 


0.42 


3.84E- 
07 


CACNA2D3 


3_54748368 


-2.17 


0.43 


6.80E- 
07 


CACNA2D3 


3_54793253 


1.05 


0.21 


6.86E- 
07 


CACNA2D3 


3_1 861 44694 


-3.17 


0.78 


5.83E-05 


LOC253573 


3_54784952 


2.11 


0.43 


7.90E- 
07 


CACNA2D3 


3_54793253 


2.09 


0.43 


1.67E- 
06 


CACNA2D3 


3_54799449 


1.05 


0.21 


7.79E- 
07 


CACNA2D3 


3_38845381 


-3.68 


0.91 


6.06E-05 


SCN10A 


3_54793253 


2.08 


0.42 


9.59E- 
07 


CACNA2D3 


3_54784952 


2.09 


0.44 


2.05E- 
06 


CACNA2D3 


3_54784952 


1.05 


0.21 


7.90E- 
07 


CACNA2D3 


3_1 86209848 


-3.05 


0.78 


0.000107 


LOC253573 


3_54779240 


2.08 


0.43 


1 .04E- 

06 


CACNA2D3 


3_54799449 


2.09 


0.44 


2.31 E- 
06 


CACNA2D3 


3_54748368 


-1.05 


0.21 


8.1 5E- 

07 


CACNA2D3 


3_87619500 


2.32 


0.60 


0.000109 


POU1F1 


3_54756448 


-2.10 


0.43 


1 .07E- 
06 


CACNA2D3 


3_54756448 


-2.09 


0.44 


2.35 E- 
06 


CACNA2D3 


3_54779240 


1.03 


0.21 


9.61 E- 
07 


CACNA2D3 


3_1 77961 323 


-2.05 


0.53 


0.000115 


KCNMB2- 

m 


3_54756196 


2.06 


0.43 


1 .46E- 
06 


CACNA2D3 


3_54756196 


2.08 


0.44 


2.48E- 
06 


CACNA2D3 


3_54807320 


1.03 


0.21 


9.67E- 
07 


CACNA2D3 


3_72651668 


2.05 


0.53 


0.000132 


5HQ1 


3_54793450 


-2.07 


0.43 


1.57E- 
06 


CACNA2D3 


3_54747244 


2.07 


0.44 


2.65 E- 
06 


CACNA2D3 


3_54756196 


1.03 


0.21 


1 .26E- 
06 


CACNA2D3 



3J86149493 -3.04 0.79 0.000134 LOC253573 3_54740011 2.05 0.43 1.75E- CACNA2D3 3_5474001 1 2.07 0.44 2.67E- CACNA2D3 3_54799706 1.02 0.21 1.27E- CACNA2D3 

06 06 06 



•In the field "closest genes"; a bold gene name indicates that the SNP on the same row is right on that gene. 
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Table 3 Power calculation of all 4 methods (based on the 
200 simulations) 



Method 


Baseline 


Mean measure 


Two-stage 


LME 


Power 


22% 


41% 


40.5% 


32.5% 



the 2-stage method, identifying the same genes. The sig- 
nals found with the 2-stage method (third row of Man- 
hattan plot in Figure 1) were almost identical to those 
with the LME method, for both SBP and DBP. There- 
fore, we concluded that the mean measure and 2-stage 
methods were 2 efficient ways to analyze longitudinal 
data when the goal is to examine level of a trait. Only 
the longitudinal approach can evaluate associations with 
trends over time. 



4 

5 

6 

7 
8 
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